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Abstract 

In this paper we present a ‘stability theorem’ for stochastic approxi¬ 
mation (SA) algorithms with ‘controlled Markov’ noise. Such algorithms 
were first studied by Borkar in 2006. Specifically, sufficient conditions 
are presented which guarantee the stability of the iterates. Further, un¬ 
der these conditions the iterates are shown to track a solution to the 
differential inclusion defined in terms of the ergodic occupation measures 
associated with the ‘controlled Markov’ process. As an application to our 
main result we present an improvement to a general form of temporal dif¬ 
ference learning algorithms. Specifically, we present sufficient conditions 
for their stability and convergence using our framework. This paper builds 
on the works of Borkar and Benveniste, Metivier and Priouret. 


1 Introduction 

Let us begin by considering the general form of stochastic approximation algo¬ 
rithms: 

Xn+I = Xn + a{n) {h{xn) + Mn+i) , where (1) 

(f) h : —>■ is a Lipschitz continuous function; 

{ii) {a(n)}„>o is the given step-size sequence such that a(n) = oo and 

ETo«Nk"<oo; 

{in) {Mn\n>i is the sequence of square integrable martingale difference terms. 

In 1996, Benaim showed that the asymptotic behavior of recursion o can 
be determined by studying the asymptotic behavior of the associated o.d.e. 

x{t) = h{x{t)). 
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This technique is popularly known as the ODE method and was originally de¬ 
veloped by Ljung in 1977 [S]. In [3] it is assumed that sup ||a::„|| < oo a.s., in 

n>0 

other words the iterates are assumed to be stable. In many cases the stability 
assumption becomes a bottleneck in using the ODE method. This bottleneck 
was overcome by Borkar and Meyn in 1999 [5]. Specifically, they developed 
sufficient conditions that guarantee the ‘stability and convergence’ of recursion 

(P). 

In many applications, the noise-process is Markovian in nature. Stochastic 
approximation algorithms with ‘Markov Noise’ have been extensively studied 
in Benveniste et. al. [S]. These results have been extended to the case when 
the noise is ‘controlled Markov’ by Borkar [B]. Specifically, the asymptotics 
of the iterates are described via a limiting differential inclusion {DI) that is 
defined in terms of the ergodic occupation measures of the Markov process. As 
explained in [6] , the motivation for such a study stems from the fact that in many 
cases the noise-process is not Markov, but its lack of Markov property comes 
through its dependence on a time-varying ‘control’ process. In particular this 
is the case with many reinforcement learning algorithms. In |^, the iterates are 
assumed to be stable, which as explained earlier poses a bottleneck, especially 
in analyzing algorithms from reinforcement learning. The aim of this paper is 
to overcome this bottleneck. In other words, we present sufficient conditions 
for the ‘stability and convergence’ of stochastic approximation algorithms with 
‘controlled Markov’ noise. Finally, as an application setting, we consider a 
general form of the temporal difference learning algorithms in reinforcement 
learning and present weaker sufficient conditions (than those in literature) that 
guarantee their stability and convergence using our framework. 

The organization of this paper is as follows: 

In Section \2.1\ 'Ne present the definitions and notations involved in this pa¬ 
per. In Section \2.2^ 'we discuss the assumptions involved in proving the stability 
of the iterates given by ®. 

In Section Owe show the stability of the iterates under the assumptions out¬ 
lined in Section dm (Theorem [T|). 

In Section^we present additional assumptions which coupled with assump¬ 
tions from Section W^ sie used to prove the ‘stability and convergence’ of recur¬ 
sion (O) (Theorem 0). Specifically, Theorem [3] states that under the aforemen¬ 
tioned sets of assumptions the iterates are stable and converge to an internally 
chain transitive invariant set associated with x{t) G h{x(t)). For the definition 
of h the reader is referred to Section 01 

In Section [5| we discuss an application of Theorem 01 We present sufficient 
conditions for the ‘stability and convergence’ of a general form of temporal dif¬ 
ference learning algorithms, in reinforcement learning. 
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2 Preliminaries and Assumptions 

2.1 Notations &; Definitions 

In this section we present the definitions and notations used in this paper for 
the purpose of easy reference. Note that they can be found in Benai'm et. al. [1], 
Aubin et. al. [I], [2 and Borkar [7]. 

Marchaud Map: A set-valued map h : R” —>■ {subsets of R."*} is called a 
Marchaud map if it satisfies the following properties: 

(i) For each a: S R", h{x) is convex and compact. 

(ii) (point-wise boundedness) For each x € R", sup |lri;|| < AT (1 -I- ||a:||) for 

w^h{x) 

some K > 0. 

(iii) h is an upper-semicontinuous map. 

We say that h is upper-semicontinuous, if given sequences {Xn}n>i (in 
R”) and {yn}n>i (in R™) with -)■ cc, i/„ -)> y and ?/„ G h{xn), u > 1, 
y G h{x). In other words, the graph of /i, {{x,y) : y G h{x), x G R"}, is 
closed in R" x R™. 

If the set-valued map H : ^ {subsets of R"^} is Marchaud, then the differ¬ 

ential inclusion (DI) given by 


x{t) G H{x{t)) (2) 

is guaranteed to have at least one solution that is absolutely continuous. The 
reader is referred to Aubin & Cellina [1] for more details. 

If X is an absolutely continuous map satisfying ([2) then we say that x G X]- 

A set-valued semiflow $ associated with ([2 is defined on [0, -l-oo) x R"^ as 
follows: 

$t(x) := {x(t) I X G X];^(0) = B X M C [0,-|-oo) x R^, define 

$b(M) := U $i(x). 

{teB, xeM} 

Let M C R'^, the oj — limit set be defined by := Hoo 

Similarly the limit set of a solution x is given by L{x) = Hoo ^([^’ +oo)). 

Invariant Set: M C R'^ is invariant if for every x G M there exists a 
trajectory, x, entirely in M with x(0) = x. In other words, x G ^ with 
x(t) G M, for all t > 0. 

Internally Chain Transitive Set: M C R"^ is said to be internally chain 
transitive if M is compact and for every x, ?/ G M, e > 0 and T > 0 we have the 
following: There exist ..., that are n solutions to the differential inclusion 
x{t) G h(x{t)), a sequence xi(= x),..., x„+i(= y) C M and n real numbers 
ti,t 2 , ■ ■ ■ ,tn greater than T such that: G N^(xi+i) and ^ 

for 1 < f < n. The sequence (xi(= x),... ,x„+i(= y)) is called an (e,T) chain 
in M from x to y. 

Given x G R'^ and A C R'^, define the distance between x and A by d{x, A) := 
inf{||a — y\\ \ y G A}. We define the 5-open neighborhood of A by N^{A) := 
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{x I d{x,A) < 5}. The 5-dosed neighborhood of A is defined by N^{A) := 
{x I d{x, A) < (5}. 

Attracting Set: A C is an attracting set if it is compact and there exists 
a neighborhood U such that for any e > 0 there exists T(e) > 0 such that 
d^[T(e), ,+oo)(C^) C N'^{A). Then U is called the fundamental neighborhood of 
A. In addition to being compact if the attracting set is also invariant then 
it is called an attractor. The basin of attraction of A is given by B{A) = 
{x I a;$(a;) C A}. The set A is Lyapunov stable if for all 5 > 0, 3 e > 0 such 
that C N^{A). We use T(e) and interchangeably to denote 

the dependence of T on e. 

The open ball of radius r around 0 is represented by -Br-(O), while the closed 
ball is represented by Br{0). 

Upper limit of sequences of sets: Let {Kn}n>i be a sequence of sets in 
The upper-limit of {Kn}n>i is given by Limsupn^ooKn '■= {y \ d{y, Kn) = 

n—^oc 

0 }. 

2.2 Assumptions 

Let us consider a stochastic approximation algorithm with ‘controlled Markov’ 
noise in 

Xn+I = Xn + a{n) [h{xn,yn) + M^+i], where (3) 

(i) h : R'^ X 5" —>■ R'^ is a jointly continuous map with S a compact metric 
space. The map h is Lipschitz continuous in the first component, further 
its constant does not change with the second component. Let the Lipschitz 
constant be L. This is assumption (1) in Section 2 of Borkar [^. Here we 
call it (Al). 

(ii) The step-size sequence {a(n)}„>o is such that a{n) > 0 for all n > 0, 

^^Qa(n) = oo and ^ Without loss of generality let 

sup a{n) < 1. This is assumption (3) in Section 2 of Borkar [^. Here we 

n>0 

call it (A3). 

(iii) is a sequence of square integrable martingale difference terms, 
that also contribute to the noise. They are related to {Xn}n>o by 

E [||M„+i|p I En] < K{1 + ||x„|p), where n > 0. 

This is assumption (2) in Section 2 of Borkar [^. Here we call it (A2). 

(iv) {yn}n>o is the ^-valued ‘Controlled Markov’ process. 

Note that S is assumed to be polish in [B]. As stated in (Al), in this paper 
we let 5 be a compact metric space, hence polish. Among the assumptions 
made in [B], (Al) — (A3) are relevant to prove the stability of the iterates. The 
remaining assumptions are listed in Section |4] where we present the result on 
the ‘stability and convergence’ of the iterates given by See Borkar [Q for 
more details. 

For each c > 1, we define functions he : R'^* x S' —>■ R'^ by hc{x,y) := 
h{cx,y)/c. 
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We define the limiting map hoo : x S' —>■ {subsets of by hoa{x, y) := 
Limsupc^oo{hc{x, y)}, where Limsup is the upper-limit of a sequence of 
sets (see Section \TT\i . 

For each x S R"* define H{x) :=cd [ U hao{x, y) ). 


We replace the stability assumption I sup||a;„|| < oo I in [5] with the following 

Vn>0 

two assumptions. 


(SI) If c„ t oo, j/n —t 2/ Slid 1™ hc„{x,yn) = u for some u G then u G 


haoix,y). 


(S2) There exists an attracting set, A, associated with x{t) G H{x{t)) such that 

sup||u|| < 1. Further, Bi(0) is a subset of some fundamental neighborhood 

u^A 

of A. 


Assumption (T2), discussed in Section[5l is a sufficient condition for (S'2) to be 
satisfied. One could say that (T2) constitutes the ‘Lyapunov function’ condition 
for DI. We shall show that iJ is a Marchaud map in Lemma O As explained 
in [I], it follows that the DI, x{t) G H{x{t)), has at least one solution that is 
absolutely continuous. Hence assumption (5*2) is meaningful. 


We begin by showing that he satisfies (Al) for all c > 1. Fix xi,X2 G R'^, 
y G S and c > 1, we have 

\\hcixi,y) - hc{x2,y)\\ = \\h{cxi,y)/c- h{cx2,y)/c\\, 

\\h{cxi, y)/c— h{cx2,y)/c\\ < L||cxi — ca: 2 ||/c, hence 
||^c(a:i,y) - hc{x2,y)\\ < L\\xi - X2\\. 

We thus have that he is Lipschitz continuous in the first component with Lip- 
schitz constant L. Further, for a fixed c this constant does not change with 
y. Since c was arbitrarily chosen it follows that L is the Lipschitz constant 
associated with every he- It is trivially true that he is a jointly continuous map. 


Fix c > 1, a: £ R”* and y G S, then 

\\he{x,y) - heiA,y)\\ <L||a:-0||, hence 

\\he{x,y)\\ < ||/ic(0,2/)|| +L||a:||. 

Since h(Q, •) is a continuous function on S (a compact set) and c > 1 we have 
||^c(0, Olloo < ll^(0j Olloo < M for some 0 < M < oo. Thus 

\\he{x, y)\\ < AT (1 -I- ||a;||) , where K = LV M. 

We may assume without loss of generality that K is such that E [||M„+i|p | < 

K {l + ll^nlp) also holds for all n > 0 (assumption (A2)). Again K does not 
change with c. 
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Fix a; G and y G S. As explained in the previous paragraph we have, 


sup \\hc{x,y)\\ < K {1 + ||a:|l). 

C>1 


The upper-limit of {hc{x,y)}c>i, Limsup. 
Recall that hoo(x,y) = Limsup^ 

Hence, 


^{hc{x,yy\, is clearly non-empty. 


{hc{x,y)) and H{x) = co ( U haoix.y) 

y&S 


sup ||m|| < Rr(l-I-|la;||) and 
sup ||w|| < R:(1-H ||a;||). (4) 

u£H{x) 

We need to show that H is a Marchaud map. Before we do that, let us prove 
an auxiliary result. 

Lemma 1. Suppose Xn ^ x in —>■ y m S', c„ f oo and lim hc„{xn, y-n) = 

CntoO 

u. Then u G hoo{x,y). 

Proof. Consider the following inequality: 

II ^Cn Vn) "all ^ II {Xji , yn) ^11 “f II {x^ Un) {Xji-, l/n) II ■ 

Since ||hc„(a;,?/„) - hc„(a:„,?/„)|| < L||a;„ -a:|| and lim hc„ixn,yn) = u, we get 

c „—>-00 

lim hc^{x,yn) = u. 

Cn—^OO 

It follows from (SI) that u G hoo{x,y). □ 

The following is a direct consequence of Lemma [TJ If a;„ —)> a: in R'^, 
{yn} GL S and Cn —>■ oo then d{hc^{xn,yn), H(x)) —>■ 0. If this is not so, 
then without loss of generality we have that d{hc„{xn,yn), Hix)) > e for some 
e > 0. Since S is compact, 3{m{n)} C {n} such that lim ym{n) = U and 

m{n)—¥oo 

(a^m(n)i ?/m(n)) ^ for some y G S and some u G R'^. We have Xjn(n) 2 :, 

2/m(n) d, Cm(n) cx) and hc^^^}Xm{n), Vmin)) U- It follows from Lemma [U 
that u G hooix,y) Q H{x). This is a contradiction. 

Lemma 2. H is a Marchaud map. 

Proof. Recall that H{x) = co ( U hao{,x,y) ). As explained earlier (c/. ((4])), 

\y(iS J 

sup ||m|| < A:(1 -h ||x|l). 

u^H{x) 

Hence P[ is point-wise bounded. From the definition of H it follows that H{x) 
is convex and compact for each a: G R"^. 
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It is left to show that H is upper semi-continuous. Let Xn ^ x, Un ^ u and 
Un G H{xn), n > 1. We need to show that u G H{x). If this is not true, then 
there exists a linear functional on say /, such that sup f{v) < a — e 

v£H{x) 

and f{u) > a -I- e, for some a G R and e > 0. Since —>■ u, there exists N 

such that for each n > N f{un) > a + §, ie., H[xn) H [/ > a -f |] 0, here 

[/ > «] is used to denote the set {x \ f{x) > a}. For the sake of notational 

convenience let us denote U hao{x,y) by A{x) for all a: G R"^. We claim that 

y&S 

A{xn) n [/ > a -I- |] for all n > N. We shall prove this claim later, for now 
we assume that the claim is true and proceed. 

Pick Wn G A(xn) n [/ > a + ^] for each n > N. Let Wn G hoo(a:„,j/„) 
for some yn G S. Since {wn}n>N is norm bounded it contains a convergent 
subsequence, say {wn{k}}k>i C {wn}n>N- Let = w. Since Wnik) G 

hoo (^n(fc) 5 I/n(/c) )5 3 Cn(^k) G M SUCh that ^Cn(k)i^n{k) T yn{k)}\\ ^ n{k) ' 

sequence {cn(k)}k>i is chosen such that Cn(^k+i) > Cn(k) for each fc > 1. Since 
{yn(k)}k>i is from a compact set, there exists a convergent subsequence. For the 
sake of notational convenience (without loss of generality) we assume that the 
sequence itself has a limit, ie., yn(k) y for some y G S. We have the following: 
^n(k) T ^5 ^n{k) ^ yn{k) ^ 1/; '^n{k) ^ ^ and Wn(^k) G hcn(^k)^^'i^{k)^yn{k)) 

for fc > I. It follows from Lemma [T] that w G hao(x,y). Since Wn{k) —t w and 
f{wn{k)) > a + § for each fc > 1, we have that f{w) > a -I- |. This contradicts 
sup f{w) < a — e. 

w£H{x) 

It remains to prove that A{xn) r\[f > a + ^ 4> ior all n > N. If this were 

not true, then 3{m{k)}k>i C {n > N} such that A{x^(k)) ^ [/ < ad- |] for all 
k. It follows that H{xm(k)) = '^iA{Xm{k))) ^ [/ < a -I- |] for each fc > 1. Since 
Un{k) u, 3Ni such that for all n(fc) > Ni, f{un(k)) > ci+^. This leads to a 
contradiction. □ 


3 Stability Theorem 


Let us construct the linear interpolated trajectory x{t) for t G [0,c») from the 
sequence {a;„}„>o. Define t(Q) := 0 and t{n) := X]r=o^®(*)> 


x{t{n)) := and for t G (t(n), t(n-|-1)) let 
t{n -I- I) — t 


x{t) ■= 


t{n + 1) — t{n) 


x{t{n)) 


t — t(ji) 


t(n -f 1) — t{n) 


x(t(n + I)). 


Define Tq := 0 and T„ := min{t{m) : t{m) > Tn-i + T} for n > 1. 
Observe that there exists a subsequence, {m(n)}, of N such that T„ = t{m{n)) 
for all n > 0. 


We use a;(-) to construct the rescaled trajectory, x{t), for t > 0. Let t G 
[T„,T„+i) for some n > 0 and define x{t) := where r{n) = ||x(Tji)|| V 
1. Also, let x{T~_^-^) := lim x{t), t G [Tji,T„+i). The rescaled martingale 

tfTn + l 

difference terms are given by Mk+i '■= i(k) G [T„,r„+i). 
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We define a piece-wise constant trajectory, £(•), using the rescaled trajectory 
as follows: Let t G [t{m),t{m + 1)) and Tn < t{m) < t{m -f 1) < Tn+i. Define 
z{t) := Let us define another piece-wise constant trajectory 

using {j/„}„>o as follows: Let y{t) := for all t G [t{n),t{n + 1)). 

Recall that A is an attracting set associated with x{t) G H{x{t)) (see assump¬ 
tion (S'2) in section [2^ . Let := sup ||u||, then < 1. Choose <52, <^ 3 , and <54 

u^A 

such that < (52 < ^3 < <^4 < 1- Fix T := T{52 — ^i), where T(-) is defined in 
section O Let x[-) be a solution to x{t) G H{x{t)) such that ||a;(0)|| < 1, then 
||a:(t)|| < S 2 for all t > T{d 2 — ^i). 


Consider the following recursion: 


x{t{k + l)) = x{t{k)) + a{k) {h{x{t{k)),yk) + Mk+i), 

such that t{k), t{k -|- 1) G Multiplying both sides by l/r(n), we 

get the following rescaled recursion: 


;(t(A;-|-l)) = x(t{k)) + a{k) [hr(ri){.x(t(k)),yk) + Mfe+i) • (5) 


Note that E 


< K{i + \\xm)r)- 


The following two lemmas can be found in Borkar & Meyn [5] (that however 
does not consider ‘controlled Markov’ noise). It is shown there that the ‘mar¬ 
tingale noise’ sequence converges almost surely. We present the results below 
using our setting. 

Lemma 3. sup i?||5;(t)p < 00 . 
t>o 

Proof. Recall that T„ = t{m{n)) and T„+i = t{m{n + 1)). It is enough to show 
that 

sup E [||i(t(fc))f ] < M, 

m(n)<fe<m(n+l) 

for some M(> 0) that is independent of n. Let us fix n and k such that n > 0 
and m(n) < k < m{n -I-1). Consider the following rescaled recursion: 

x{t{k)) = x{t{k — 1)) + a{k — 1) ^z{t{k — 1)) + Mk^ . 

Unfolding the above we get, 

k-l 

x{t{k)) = x{t{m{n))) + a{l) (zm) + M+i) . 


Taking expectation of the square of the norms on both sides we get. 


EWxmw = E 


k-l 


x{t{m{n))) + ^ a{l) [z{t{l)) + Mi+i^ 








It follows from the Minkowski inequality that, 


k-l 


E^/^xmW <E^/^\x{T^)r + Y. a{l)(EE^zmW + ). 


l—m{n) 


For each I such that m{n) < I < k — 1 , ||z(<(Z))|| = \\hr(n){x{t{l)),y{t{l)))\\ < 
iF (1 + ||a:(t(^))||). Further, i? < iF (l + ||i(t(l))|p). Observe that 

Tn+i — T„ < r+ 1 (since sup„ a{n) < 1). Using these observations we get the 
following: 


k-l 


EE^^t{k))\\^< 1 + ^ ail)(^KEE^l+\\i(tmf + VKEE^l + \\a:{t{l))r)) 


l—m{n) 


k-l 


EE^x{t{k))f< 1+ Y a{l)(^K(l + E^^^\\x{m)f) + VK(l + E^/^x{m)f)), 


l—m{n) 


k-l 


EE^x{t{k))f < [l + (iF + v^)(r+l)J+(iF+v^) Y 

l—m{n) 

Applying the discrete version of Gronwall inequality we now get, 


E^/^\\xitik))f < \l + {K + VK)iT + 1) 


^iK+VK)iT+l) 


Let us define M := ( 1 + {E + '/K){T + 1) . Clearly M is 

independent of n and the claim follows. □ 

Lemma 4. The sequence C„, n > 0, converges almost surely, where (n '■= 
YJk=l a(fc)-Wfc+i for alln> 1. 

Proof. It is enough to prove that 


YE\\\a{k)Mk+if I Ek 


k=0 

Instead, we prove that 


< 00 a.s. 


E 


Ya{k)^E\\\Mk+i\\^ I Ek 


.k=0 


< oo. 


From assumption (A2) we get 


E 


Ya{k)^E\\\Mk+if I Ek 


. fc =0 


< Y a{kfK (l + E\\x{t{k))f) . 


k—Q 


The claim now follows from Lemmaand (A3). 


□ 
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Let x^{t), t S [0,T], be the solution (up to time T) to x^{t) = z{Tn + t) 
with initial condition a:"(0) = x{Tn). Clearly, 

a:”(t) = x{Tn)+ f z{T„ + s)ds. (6) 

Jo 

Lemma 5. lim sup ||a^"(t) — ^(i)|| = 0 a.s. 

Proof. Let t G [t{m{n) + k),t{m{n) + /c + 1)) such that T„ < t{m{n) + k) < 
t{m(ri) + fc + 1) < Tn+i, where n > 0. First we prove the lemma when t{m(n) + 
fc + 1) < Tn+i- Consider the following: 


x{t) 


t{m{n) + k + 1) — t 
a(m{n) + k) 


x{t{m{n)+k))-\- 


t — t{m{n) + k) 
a{m(n) + k) 


x{t{m{n)-\-k+l)). 


Substituting for x(t{m(n) + /c + 1)) in the above equation we get: 


x{t) = 


t{m(n) + k + 1) — t 


ft — t{m(n) + k)\ 
x{t{m{n) + k))+ [ — / N , 

\ a{m[n) k) ) 


a{m{n) + k) 

{x{t{m{n) + k)) + a{m{n) + k) (^hr(n)ix{tim{n) + fc)),ym(„)+fe) + 
hence, 


xit) = x(t{m{n)+k)) + {t - t{m{n) + k)) (^/i,.(„)(x(t(m(n) + fc)), 2/m(«)+/c) + M„(„)+fc+i) • 
Unfolding x{t{m{n) + k)), we get (see (O), 


k-l 

x{t) = x(Tn)+^a{m{n)+l) (h^(„)(5(t(TO(n) + 1)),+ 
1=0 

+k)) (/i^(„)(x(<(m(n) + fc)),r/m(„)+fe) + Mm{n}+k+i^ ■ (7) 

Recall that ^ 

X^{t) = x{Tn) + [ z{Tn + S) ds. 

Jo 

Splitting the above integral, we get 

x'^(t)= x{Tn) + ^ / z(s)ds+ / z{s)ds. 

Jt{m{n)-\-l) Jt{m{n)-\-k) 

Thus, 

k-l 

a;”(t) = i(r„) + ^a(m(n) + l)hr(n)ix{t{m{n) + l)),ym{n)+i) + 

1=0 

+ k)) hr(n){x{t{m{n) + fc)), (8) 


10 






From © and m, we get the following: 




k-l 


^ a(m(n) + 0^m(n)+i+i 




+ 


{t - t{m{n) + k)) 


||x (t) x{t)[| ^ ||Cm(n)+fc Cm(n) II “t“ ||^m(yjJ+/j;-|-i Cm(n)+fc||- 

If t{m(n) + fc + 1) = T„+i then in the above set of equations we may replace 
x(t{rn{n) + fc + 1)) with x{T~j^-^). The arguments remain the same. Since 
n > 1, converges almost surely, the lemma follows. □ 

Recall that T = T{52—5i). Let us view {x"([0, T]) | n > 0} and {x"([T„,Tji+ 
T]) I n > 0} as subsets of (endowed with the sup-norm, ||- ||oo). We 

claim that {x"([0, T]) | n > 0} is equicontinuous and point-wise bounded almost 
surely. Since ||x"(0)|| = ||x(r„)|| < 1, we can use Gronwall inequality to show 
that sup ||x"(-)||oo < oo almost surely. Note that ||x"(-)||oo = sup ||x"’(t)||. 

n>0 tG[0,T] 

Hence we conclude that the aforementioned set is almost surely point-wise 
bounded. 


Now we show that the family of functions is almost surely equicontinuous. 
Recall that sup £’||x(t)|d < oo a.s. and p(t)|| < K{1 ||x([t])||), where [f] := 

i>0 

max{f(TO) I t{m) < t}. Hence sup p(t)|| < oo a.s. For <5 > 0, we have the 

t>o 


following: 


\\x-{t + S)-x^{t)\\< 




p(s)|| ds. 


Since sup ||^(t)|| < oo a.s. it follows that 
t>o 


p"(f-b d) — x"(f)|| < J M ds = MS, where 

M is a constant (possibly sample path dependent) such that sup p(t)|| < M. 

t>o 

Hence we conclude that {x"([0,T]) | n > 0} is equicontinuous. It follows 
from Arzela-Ascoli Theorem that {x"'([0,T]) | n > 0} is relatively compact in 
C([0,T],R‘^). From Lemma [S] it follows that {a;([T„,T„ -b T]) | n > 0} is also 
relatively compact in C([0,T],R"^). 

Using Gronwall’s inequality we can show that sup pfc|| < oo a.s. if and only 

k>0 

if sup |lx(T„)|| < 00 a.s. To prove the stability of the iterates it is enough to 

n>0 

show that sup r(n) < oo a.s. given that the recursion satisfies (HI) — (H3), 

n>0 

S'! and S2 (see Section If sup r{n) = oo then there exists {1} C {n} 

n>0 

such that r{l) f oo. In the lemma that follows we characterize the limit set of 
{x{[Ti,Ti+T]) I {/} C {n} & r(l) t oo} in C([0,r],Rp. 
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Lemma 6. Let {?} C {n} such that r{l) f oo. Any limit of {x([T;,Ti + 
T]) I {^} C {n} & r{l) t 00 } in (^([O,T], is of the form x{t) = a:(0) + 
/p z{s)ds, where a;(0) S ^i(O) and z : [0,T] is a measurable function 

such that z(t) G H{x{t)), t G [0,T]. 

Proof. For t > 0 define [t] := max{i(TO) | t{m) < t}. Fix to G [T„,T„+i), 
z{to) = hr(n){x{[to]),y{[to]))- Since |lft,^(„)(x([fo]),|/(N))ll < K{1 + p(N)||), 
we have p(io)|| < K{1 + ||x([fo])||)- It follows from Lemma [3] that p(t)|| < 00 
a.s. Recall that {x(Ti+-) \ {1} C {n}} is relatively compact in CQO,T], 
Without loss of generality we may assume that 

x{Ti +-) —>■ x(-) in T], R'^), for some x(-) G C'([0, T], R'^); 

z(Ti+-)^ z(-) weakly in Lp[0, T],R'^), for some z(-) G L^([0,T],R‘^). 


It follows from Lemma [5]that x‘(-) x(-) in (^([O,T],R'^). Letting r(l) 00 

in the following equation, 


x‘(t) 

x(t) 


x*(0) + f z{Ti + s) ds, we get 
Jo 

x(0) = / z{s)ds. 

Jo 


Since p*(0)|| = p(Ti)|| < 1 , we have that p(0)|| < 1. Further, since 
z{Ti +-) —>■ 2 ;(') weakly in L^([0, T],R"^) it follows from the Banach-Saks Theo¬ 
rem that 


3 {k{l)} C {1} such that ^ X! ) strongly in ^^([0, T], R'^). 

^ 1=1 


Further, 


^ m(Af) 

3 {m{N)} C {N} such that ^ z{Tm)-\-) z{-) a.e. on [0,T]. (9) 

Fix to G [0,T] such that (O holds, i.e., 

^ m{N) 

~llv\ +^o) = zito). (10) 

m{N)^oo ?77.(A'j 

We know that z{Tk{i) + to) = hr{k{i)){x{[Tk{i) + to]),y{[Tk(i) +to]))- Note that 
y{[Tk{i) + to]) = y{Tk(i) + to). 

We claim the following: For any e > 0 there exists N such that for all n > fV 
p(t(TO)) —a;(t(TO + I))|| < e, where T„ < t{m) < t(m + I) < T„+i. If t(m + I) = 
Tn+i then we claim that p(t(m)) — x{T~_^_^)\\ < e. We shall prove this later, 
for now we assume it to be true and proceed. 
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Since —>■ x(to) it follows from the above claim that a;([rfe(;)+to]) —t 

x{tQ). Since r{k{l)) f oo it follows from Lemma [T] that 


lim d {hr(k{i)){x{[Tk(i) + to]), ?/([T’fc(/) + to])), H{xito))) = 0 i.e., 

r{k{l))'\'oo 

lim d{z{Tk(i) +to),H{x{to))) =0 


Further, since H{x{to)) is convex and compact, it follows from equation (fTIl 
that z{to) G H{x{to))- On the measure zero set of [0,T] where ([2]) does not hold, 
the value of z {-) can be modified to ensure that z{t) G H{x{t)) for all t G [0, T\. 

It is left to prove the claim that was made earlier. We first show that given any 
e > 0 there exists N such that n> N implies that ||a;(t(m)) — x{t{m + 1))|| < e, 
where T„ < t{m) < t{m + 1) < Tn+i- We know that 



Hence, 


\\x{t{m)) - x{t{m + l))\\ < a{n)\\hr(n){x{tim)),y{t{m)))\\ + ||Cn+i - Cn||- 

From the above inequality becomes 

||i(t(m)) - x{t(m + 1 ))|| < a{n)K{l + ||i(t(TO))|l) + ||C„+i -Cn||- 

It follows from LemmasISl&lllthat a(n)iG(l + ||a;(t(TO))||) —?> 0 and ||Crt+i“Crt|| 

0 respectively in the ‘almost sure’ sense. In other words, there exists N (possibly 
sample path dependent) such that the claim holds. The second part of the 
unproven claim considers the situation when t{m + 1) = r„+i, the proof of 
which follows in a similar manner. □ 

Theorem 1 (The Stability Theorem). Under assumptions (HI) —(H3), (5'1) & (32), 
sup ||a;„|| < 00 a.s. 

n >0 

Proof. Define B := {sup x{t) < oo} fl (Cn converges}. It is enough to show that 
t>o 

sup ||a;„|| < oo on B. Let us assume the contrary i.e., sup||a;n|| = oo onV C B 

n >0 n^O 

such that P(T’) > 0. Fix w G PflS and choose N such that the following hold: 

1. For all to( 1) > TV, sup ||a;(T/+t) — a:^(t)|| < ^3 — ( 52 . This is possible since 


tG[0,T] 


\\x{Ti+- ) — a::*(- )|| —>• 0 on H (Lemma [S|). Recall that {m(n)} C N is such 
that t{m{n)) = T„ for all n > 0 . 

2. For all m{l) > N, ||i;(T;^^)|| < 84 .. This is possible since \\x{Ti + T) — 
x{Ti^i)\\ —!> 0 as Z —>■ 00 and ||a;(T; + T)|| < (la for large m{l). 

3. For all m{l) > N, r{l) > 1. 


We have. 


\m+4)\\/rii) wHTr+i)]] 


( 11 ) 


\\xm)\\/r{l) ||x(Tz)|| 
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For m{l) > N, we have ||a;('r;+i)|| < <54 and ||a;(T/)|| = 1. Hence it follows from 
(fTTl) that 




< ( 54 {< 1 ). 


( 12 ) 


Let us closely analyze the implication of (fT^ . Since ||x(Ti)|| > 1, it follows 
that ||x(Ti+i)|| < (54 ||x(T;)||, further if ||a;(T/_|_i)|| > 1, we have 


MTi^2)\\<s4\\m+i)\\<si\\m)i 


We see that the trajectory has a tendency to fall into the unit ball at an expo¬ 
nential rate (from the outside). Let to < 21 be the last time that the trajectory 
‘jumps’ from inside the unit ball to the outside. This is because to is the last 
time before Ti when the trajectory ‘jumps’ outside and |lx(Ti)|| > 1. It follows 
from the observation that the trajectory falls exponentially into the unit ball 
that this jump is at least ||iE(T;)|| — 1. Since r{l) f oo the trajectory is forced to 
make larger and larger ‘last jumps’ from inside the unit ball to the outside such 
that the lengths of these jumps (> r(Z) — 1) ‘run off’ to infinity. Further, these 
jumps are made within time T-|- 1. Using Gronwall’s inequality we however get 
a contradiction. □ 


4 Convergence Theorem 

We begin this section by presenting the additional assumptions imposed on re¬ 
cursion [3l These assumptions are coupled with those made in Section 12.21 to 
prove that the iterates are stable and converge to an internally chain transi¬ 
tive invariant set associated with a DI that is defined in terms of the ergodic 
occupation measures associated with the ‘Markov process’. 

The additional assumptions made are similar to those in Borkar [B]. We list 
them below. 

(Bl) {yn} n>o is an S'—valued Markov process with two associated control pro¬ 
cesses: {xn}n>Q and another random process {Zn}n>o taking values in a compact 
metric space U. Thus 

P {yn+i & A \ ym,Zm,Xm, m <n) = / p{dy\yn,Zn,Xn), n>0, 

JA 

for A Borel in S. The map 

{y,z,x)€SxUxW^^ p{dw\y, z, x) G P{S) 

is continnous, further it is uniformly continuous on compacts in the x vari¬ 
able with respect to the other variables. 7^(S) is used to denote the space of 
probability measures on S. 

Let (fi : S ^ = viy^dz) G P{U) be a measurable map. Suppose 

the Markov process has a (possibly non-unique) invariant probability measure 
Vx,ipi,dy) G 'P(S), we can define the corresponding ergodic occupation measure 

'^x,v{dy,dz) := r]x,cpidy)(p{dy,dz) G 7^(5' x U). (13) 
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Let D{x) be the set of all such ergodic occupation measures for a prescribed 
X. It can be shown that D{x) is closed and convex for each x G Further, 
the map x i— D(x) is upper-semicontinuous. For a proof of the aforementioned 
results the reader is referred to Chapter 6.2 of [7]. 

(B2) D{x) is compact. 

Let us define a V{S x 17)-valued random process p{t) = fj,{t,dydz), t > 0, 
by p{t) := € [t(n),<(n + 1)), for n > 0. For f > s > 0, define S 

'P{S X [/ X [s, t]) by p\{A X B) := fi{y, A) dy for A, B Borel in S' x 17, [s, t] 

respectively. 

(B3) Almost surely, for t > 0, the set s > 0} remains tight. 

Dehne h{x,v) := / h{x,y)v{dy,U) for v G V{S x U). We use this to define 
the following DI. 

x{t) G h{x{t)), where h{x) := {h{x,v) \ v G D{x)}. (14) 

Theorem 2 (Stability & Convergence). Under assumptions (Al) — (A3), (SI), 
(S2) and (SI) — (S3), almost surely the iterates given by (0) are stable and 
converge to an internally chain transitive invariant set associated with x(t) G 
h{x{t)). 

Proof. Under assumptions (Al) — (A3), (SI) and (S2) the stability of the it¬ 
erates follow from Theorem [TJ Now, we invoke Theorem 3.1 of Borkar [5] to 
conclude that the iterates converge to an internally chain transitive invariant 
set associated with x{t) G h{x{t)). □ 

5 Application to temporal difference learning 

Temporal difference (TD) learning is an important prediction method which 
combines ideas from Monte Carlo and dynamic programming. It has been mostly 
used to solve problems from reinforcement learning. There are several variants 
oi TD algorithms. 

Consider the general form oi a TD algorithm with ‘controlled Markov noise’. 

Xn+I =Xn + a{n) {h{xn,yn) + M^+i) , where (15) 

(i) h : X S —>■ is of the form h{x, y) = A{y)x + b{y). Here A : S —>■ 

is a matrix valued function and 6 : S —>■ R'^ is a vector valued function. 

(ii) {a(n)}„>o is the given step-size sequence such that ^ 

< oo. Recall that this is assumption (A3). 

(iii) is the sequence of square integrable martingale difference terms 
such that 

E [||M„+if I Fn] < K{1 + ||a:„f), n > 0. 

Recall that this is assumption (A2). 

(iv) {yrt}ri>o is a S'-valued ‘Controlled Markov Process’. We assume that S is 
a compact metric space. 
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For a detailed exposition on TD algorithms the reader is referred to Tsitsiklis 
and Van Roy m- In this section, we impose conditions on A and b that 
guarantee the ‘stability and convergence’ of the iterates given by m- 

Remark: It is important to note that our TD algorithm (cf. 1151) is more 
general than the regular TD update with function approximation, as in (say) 
Tsitsiklis and Van Roy [10]. In particular, the regular TD with function ap¬ 
proximation can be written (see [lOj l as in m- Note also that unlike the usual 
analyses of TD, we do not assume that the Markov process {?/»} is (a) finite 
state and (6) ergodic under the given stationary policy. 

We state the first of the two assumptions below. 

(Tl) A : S ^ and b : S are continuous maps. 

We show that dm) satisfies (Al) — (A3) and (^l) if it satisfies (Tl). Since 
A and b are continuous maps, it follows that h is a jointly continuous map. 
Since A is continuous, the range of A, A(S') C is compact. Define L := 

sup ||M||(< oo). We have the following: 

M&A{S) 


\\h{xi,y) - h{x 2 ,y)\\ < ||A(j/)|| x ||a:i - X 2 \\ < L\\xi - X 2 \\. 

Hence h is Lipschitz continuous in the first component, with Lipschitz constant 
L, further this constant does not change with the second component ((Al) is 
satisfied). Assumptions (A2) and (A3) are trivially satisfied. 

We have hc{x,y) = {A{y)x + b{y)/c} and hao{x,y) = {A{y)x}. Now, we 
show that (S'!) is satisfied. Let c„ t oo, yn ^ y and lim hc,^{x,yn) = u. 

n—^oo 

We need to show that u = hoo{x,y). Since A is continuous, lim Aljjn)x = 

n—>-oo 

A[y)x. Since 6 is a bounded function, lim b{yn)/cn = 0. Hence, we get u = 

n—^oo 

lim hc„^{x,yn) = A{y)x € hoo{x,y). Before we state our second assumption we 

n—>-oo 

present an auxiliary result. 

Lemma 7. Let H : ^ {subsets o/R'’*} be a Marchaud map. Let A be an 

associated attracting set that is also Lyapunov stable. Let B be a compact subset 
of the basin of attraction of A. Then for all e > 0 there exists Tie) such that 

C N%A). 

Proof. Since A is Lyapunov stable, corresponding to V'‘(A) there exists V^(A) 
such that <l>[o_+oo)(A^'^(-A)) C 7V'‘(A). Fix xq G B. Since B is contained in the 
basin of attraction of A, 3t(xo) > 0 such that (ccq) C iV'^/"‘(A). Further, 

from the upper semi-continuity of flow it follows that, for all x € N^^^°^xo), 
^ -^'^'^'*(‘ht(xo)(2^o)) for some S(xo) > 0, see Chapter 2 of Aubin 
and Cellina m- Hence $t( 2 ,g)(a::) C N^{A) for all x G N^^^°'>{xq). Since A 
is Lyapunov stable, we get ^(((xol.+oo](a^) ^ iV'^(A). In this manner for each 
X G B we calculate t(x) and 6 ix), the collection : x G B} is an open 

cover for B. Let (xi) | 1 < f < to} be a finite sub-cover. If we define 

T(e) := niayi{tixi) | 1 < f < to} then +oo)(H) C V'‘(A). □ 
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We have H{x) = co{{A{y)x \ y G S}). It follows from Lemma[2]that H is a 
Marchaud map. We state our second assumption below. 

(T2) Let e ^ 0 and Id ; _S2+e(0) —y [0,oo). Let A. be a compact subset of 

i3i(0), clearly sup ||u|| < 1. Let the following hold: 

«ga 

(i) For all t > 0, $t(i3i+e(0)) C i3i+e(0), where ) is a solution to the DI 
x(t) G H{x{t)). 

(ii) y-i(0)=A. 

(iii) V is continuous and for all x G i?i+e(0) \ A. Further, for y G ^t{x) and 
t > 0 we have V{y) < V{x). 

Propositon 3.25 from Benaim et. al. [4] : Under (T2), A is a Lyapunov 
stable attracting set, further there exists an attractor, A, contained in A whose 
basin contains i?i+e(0). 

Since A d A and i3i+e(0) is contained in the basin of attraction of A, it 
follows that i?i+E(0) is contained in the basin of attraction of A. We have that 
Ui (0) is contained in some fundamental neighborhood of A (Lemma[7]). Further, 
sup||z|| < 1. Hence (S'2) is satisfied. Note that the attracting set associated with 

zGA 

x{t) G cd{{A{y)x{t) I y G S'}) in (S2) is A. 

Theorem 3. Under assumptions (Al) — (A3), (Tl), (T2) and (Bl) — {B3), 
almost surely the iterates given by UW are stable and converge to an internally 
chain transitive invariant set associated with x{t) G h{x{t)). 

Proof. We have shown that assumptions (Al) — (A3), (SI) & (S2) are satisfied 
by (fT511 . It follows from Theorem[I]that the iterates are stable. Further, we have 
assumed that (i31) — {B3) are satisfied by (TTSl) . It follows from Theorem [5] that 
the iterates converge to an internally chain transitive invariant set associated 
with x{t) G h{x(t)). □ 

Let us consider the special case when A is a constant map i.e., A{y) = M for 
all y € S. Thus, we get the following recursion: 


Xn+l = Xn+ a{n) [Mxn + b{yn) + M„+i] , (16) 

where M G and 6 : —>■ is a continuous map. Hence dT51) satisfies 

assumption (Tl). As explained before, (fTOl) also satisfies (Al) — (A3) and (5'1). 
It follows from the definition of {hc}c>i and H that 

hc{x,y) = Mx + b{y)/c and hoo{x,y) = Mx] 


H{x) 


U h^ 

y(SZ 


{x,y) 


= Mx. 


Note that the DI x{t) G H{x{t)) is really the o.d.e. x{t) = Mx, here. 
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Let us assume that all eigenvalues of M have strictly negative real parts. 
Then, the origin is a globally asymptotic stable equilibrium point (a globally 
attracting set that is also Lyapunov stable) associated with x{t) = Mx{t) (see 
11.2.3 of Borkar [7]). Now, we show that recursion (nni) satisfies assumption 
(T2). Solving x{t) = Mx{t), we get $t(a:(0)) = e^*x{0) for t > 0. Let t > 0 
and a;(0) G M‘'\{0}, we have that ||$t(a:(0))|| < ||a:(0)|| since all the eigenvalues 
of M have strictly negative real parts. 

Let us define the following: 

1. e := 1. 

2. V{x) : 52 ( 0 ) —>■ [0, 00 ) as V{x) := ||a:||. 

3. A := {0} (origin). 

As explained earlier, for t > 0 we have ||$t(x)|| < ||a:||, hence $*(^ 2 ( 0 )) C ^ 2 ( 0 ) 
((T2)(i) holds). 

Recall that V{x) = ||a::|| for all x G ^ 2 ( 0 ). It follows from the definition of 
V that R“^(0) = A ((T2)(ii) holds). 

Fix xq G i?2(0)\{0} and t > 0, we have ||$t(xo)|| < Ha^oll, hence F($t(a::o)) < 
V{xo) ((T2)(iii) holds). 

Since the recursion given by (fTOl) satisfies (Tl) & (T2), it follows from The¬ 
orem [3] that the iterates are ‘stable and convergent’. 

6 Conclusions 

We presented in this paper general sufficient conditions for stability and con¬ 
vergence of stochastic approximation algorithms with ‘controlled Markov’ noise. 
To the best of our knowledge this is the first time that sufficient conditions for 
stability of stochastic approximations with ‘controlled Markov’ noise have been 
provided. We further studied an application of our results as a temporal differ¬ 
ence learning algorithm and showed that the algorithm is stable and asymptot¬ 
ically convergent under weaker requirements than those in the other analyses in 
the literature. An interesting future direction would be to extend this analysis 
to the case of multi-timescale stochastic approximations that would encompass 
actor-critic algorithms - another important class of algorithms in reinforcement 
learning. 
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